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We point out that interesting features in high energy physics data can be determined from prop¬ 
erties of Voronoi tessellations of the relevant phase space. For illustration, we focus on the detection 
of kinematic “edges” in two dimensions, which may signal physics beyond the standard model. After 
deriving some useful geometric results for Voronoi tessellations on perfect grids, we propose several 
algorithms for tagging the Voronoi cells in the vicinity of kinematic edges in real data. We show that 
the efficiency is improved by the addition of a few Voronoi relaxation steps via Lloyd’s method. By 
preserving the maximum spatial resolution of the data, Voronoi methods can be a valuable addition 
to the data analysis toolkit at the LHC. 
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Introduction. Experimental searches for new physics 
(NP) are ultimately searches for “features” in the data. 
In high energy physics, the data is represented by a col¬ 
lection of “events”, which are distributed in phase space, 
V, according to the fully differential cross-section 

S-/(x-{«»- (I) 

Here x G V is a particular phase space point, typically 
labelled by momentum components of final state parti¬ 
cles, while {o} is a set of model parameters, e.g., particle 
masses, widths, couplings, etc. Thus, the distribution of 
events in phase space is nothing but a Monte Carlo sam¬ 
pling of the function 0 . which generally consists of two 
contributions: 

f{x, {a}) = fsM{x, {asM}) + fNp{x, {aNp})^ (2) 

Here fsM is the distribution expected from Standard 
Model (SM) processes, a.k.a. “the background”, while 
/atp describes possible new physics, i.e., “the signal”. 

The traditional method to search for new physics is in 
counting experiments, where one measures the total num¬ 
ber of events in a suitably chosen region of phase space, 
Vq. New physics then manifests itself as an excess over 
the SM expectation fsM{£, {o^sm}) dx. However, a 
much more powerful approach is to look at the differ¬ 
ential properties of the observed events in phase space, 
and attempt to identify structural features in their dis¬ 
tributions, which might be present in /atp, but not in 
fsM- An example of this method is the bump-hunting 
technique in resonance searches, where the Breit-Wigner 
peak in /atp “stands out” over the smooth background 
described by fsM- 

The situation gets much more complicated if some of 
the decay products (e.g., neutrinos or dark matter parti¬ 
cles) are invisible in the detector. Even then, the signal 
distribution /atp often exhibits some special features [39]. 
e.g., kinematic endpoints ms], kinematic boundaries 
iini, kinks [HHIS] and cusps [TBHTQ], which are absent 


in the background distribution fsM- This greatly moti¬ 
vates the development of new methods for reliable iden¬ 
tification of such features in the data, which would be 
tantamount to a new physics discovery. 

In this letter, we focus on two-dimensional high energy 
particle physics data, leaving the straightforward gen¬ 
eralization to higher dimensions to a future study m- 
We assume that the signal distribution exhibits a char¬ 
acteristic “edge”, along which f^p is discontinuous, as 
is typical of a kinematic boundary in phase space. Edge 
detection is a well-studied problem in the experimental 
and observational sciences [21]. There are, however, sev¬ 
eral challenges for edge detection in particle physics that 
may frustrate standard approaches, namely 

1. The data may he relatively sparse. Most work in 
edge detection has focused on images, where there is a 
data point at each pixel. By contrast, in particle physics 
we may want to discover new physics with a relatively 
small number of signal events, when large regions of 
phase space may remain unpopulated. 

2. We may not know analytieally the elass of distri¬ 
butions, fsM + /atp; that deseribe the data. If we know 
the parametric form of the distribution ([^, likelihood 
methods can be used to determine edges. However, it is 
generally difficult to obtain an exact analytical form for 
fsM^ particularly in the case of reducible backgrounds, 
where detector effects play a major role. We may also 
wish to be sensitive to “surprises” in the data with re¬ 
gards to new physics — after all, we cannot be sure, a 
priori, that we have correctly guessed the specific new 
physics model [22]. Even if we have some idea of where 
the new physics edges may be found, a general procedure 
may still be of greater practical use. 

3. The data may be in more than two dimensions. 
As we mentioned above, edge detection is generally ap¬ 
plied to two-dimensional images. However, multivari¬ 
ate analyses [23] are ubiquitous in particle physics; in 
general, we will face the problem of finding an (n — 1)- 
dimensional kinematic boundary in an n-dimensional pa¬ 
rameter space. 
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FIG. 1: Left: a traditional two-dimensional binned his¬ 

togram for a planar stochastic point process with 2000 events 
generated according to the distribution §. Right: the 
Voronoi tessellation of the same data. 


The class of methods we propose can handle all three 
of these challenges, making them an important addi¬ 
tion to the experimentalist’s toolkit for Run 2 at the 
CERN Large Hadron Collider (LHC). The starting point 
of our analysis is the Voronoi tessellation [40] of our two- 
dimensional data, where each “event” i is treated as the 
corresponding generator point for the Tth Voronoi poly¬ 
gon [26]. Voronoi tessellations have been successfully 
used in various areas of science, including condensed mat¬ 
ter physics EH, astronomy [28] and astrophysics [291130], 
but so far have not been applied to data analysis in high 
energy physics. im At the same time, one can think of 
the outcome from any high-energy collider experiment as 
an inhomogeneous Poisson point process with an inten¬ 
sity function 0 . Voronoi methods were designed for the 
analysis of precisely such processes. 

The Voronoi approach is ideally suited for finding inter¬ 
esting (e.g., singular) features in /atp, since it preserves 
the maximum spatial resolution in the data [32]. To see 
this, consider a toy example where 2000 data “points” 
(x, y) are generated according to the periodic function 

f{x) = 1 + sin ^67ry/x^ + . (3) 

The standard approach is to bin the data, e.g., as shown 
in the left panel of Fig. [^ It is clear that interesting 
features of the underlying distribution are being lost as 
a result of averaging within each bin. In contrast, the 
Voronoi tessellation of the same data, shown in the right 
panel of Fig.[^ clearly displays the radial periodicity and 
rotational symmetry of the data. Furthermore, by con¬ 
struction, the areas of the Voronoi polygons serve as 
local estimators of the values of the generating function 
f{x) at the location Xi of each generator point pi 

f{xi) ~ I/Ui. (4) 

One can further interpolate between the generator points 
by the method of natural neighbor interpolation [26] . 

Voronoi Methods for Edge Detection. Since we 
do not assume the exact knowledge of /atp, we do not 
attempt here to reconstruct the function itself, but focus 



FIG. 2: The Voronoi tessellation of 1400 data points dis¬ 
tributed according to the probability density 0 with p = 6. 
The Voronoi polygons have been color-coded by their area 
(upper left), perimeter (upper right), number of neighboring 
polygons (lower left) or scaled variance ^ (lower right). 


instead on finding edge features such as discontinuities 
[33] . Edge detection algorithms for binned data readily 
exist EH; our methods will apply to the corresponding 
Voronoi tessellation and include the following steps: 

1. Construct the Voronoi tessellation for the data set. 


2. Compute relevant attributes of the Voronoi cells. 

3. (Optionally) use the information from the previous 
step to further process the data in some way. 


4. Use some criterion to flag “candidate” edge cells. 

We can gain useful intuition from a toy example illus¬ 
trating this procedure. Consider the probability distri¬ 
bution within the unit square, 

/(^> y) = [pH{0.5 - x) + H{x - 0.5)] , (5) 

1 + p 


where H{x) is the Heaviside step function and p is a 
constant density ratio. We generate 1400 points and 
show the resulting Voronoi tessellation in Fig. [^ where 
the Voronoi polygons have been color-coded according to 
some standard properties, e.g., area, perimeter, or num¬ 
ber of immediate neighbors. Our goal will be to investi¬ 
gate the vertical edge at x = 0.5 (yellow solid line), which 
divides the square into left (L) and right (R) regions of 
constant, but unequal densities. [42 

The first three panels of Fig. |2 reveal that: (a) the 
areas of the Voronoi polygons scale in accordance with 
0; (b) the perimeter is somewhat correlated with the 
area, but the connection is not as direct as in Q; (c) the 
typical number of neighbors is similar in the two bulk 
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FIG. 3: A perfect lattice grid 0 for r = 3 (upper left), and 
dependence on r for several parameters of interest: cell area 
(upper right), cell perimeter (lower left) and scaled variance 
(lower right). Black circles denote bulk cells, while blue x 
(red +) symbols indicate edge cells in the L (R) region. 


regions. Neither of these three quantities is able to char¬ 
acteristically pick out the edge cells (outlined in black). 
This motivates us to introduce the scaled variance of the 
areas of the neighboring cells, 


(Ja _ 1 {an — a)^ 

Y ^ Wd - 1 ’ 

\neNi I 


(6) 


where Ni is the set of neighbors of the i-th Voronoi poly¬ 
gon, and a{Ni) is their mean area. The lower right panel 
in Fig. shows that the scaled variance is quite successful 
in identifying edge cells, thus we shall choose §) as our 
main selection variable [43]. 

The results from the stochastic process shown in Fig.|^ 
and in particular the success of the variable in 0 , can 
be understood analytically if we analyze a perfect lattice 
which mimics the probability distribution (§. The grid 
is generated by two integers, n and m, as 


R= [{n-\- 0.5) X + (m + 0.5) y] [H{—n) + rH{n )], (7) 


where the vectors x and y form an orthonormal basis 
and r = y/p is the corresponding linear density ratio. An 
example grid for r = 3 is shown in the upper left panel of 
Fig.i In particular, we focus on the two columns of edge 
cells: in the L region (blue x symbols) and the R region 
(red + symbols). The remaining three panels in Fig. 
show plots of some of their properties as a function of r. 
We see that both the area and the perimeter of the edge 
cells are slightly modified from their nominal values in the 
bulk, but remain in between the two extreme bulk values. 
On the other hand, the scaled variance is identically zero 
for both bulk regions, and monotonically increasing with 
r for the edge region. 




FIG. 4: Evolution of the Voronoi tessellation from Fig. 
after one (left panel) and five (right panel) Lloyd relaxation 
steps. The cells are color coded by the scaled variance 0 - 


Voronoi relaxation via Lloyd’s algorithm. Since 
we are dealing with a stochastic process, statistical fluc¬ 
tuations are inevitably present in the data. In particular, 
the lower right panel in Fig. [^reveals pockets of bulk cells 
with relatively high values of (Ja/a. Here we propose to 
filter out such extraneous cells by first applying a few 
iterations of Lloyd’s algorithm [35], where at each itera¬ 
tion, the generator point is moved to the centroid of the 
corresponding Voronoi cell. [44] The left (right) panel in 
Fig. El shows the result of this operation after one (five) 
such Lloyd iterations. As expected, relaxation causes the 
Voronoi polygons to become more regularly shaped}^. 
More importantly, the fluctuations within the bulk re¬ 
gions are washed out, thus increasing the contrast be¬ 
tween edge cells and bulk cells. 

Fig. [^reveals that Voronoi relaxation causes a net flow 
of the data points from the dense L region towards the 
sparse R region. As a result, the edge cells with high (Ja/a 
have moved away from their original locations (near the 
vertical yellow line). This is why, if we were to tag edge 
candidates after a certain number of Lloyd iterations, we 
need to trace them back to their original locations before 
doing any further quantitative data analysis. 

The left panel in Fig. takes a closer look at one rep¬ 
resentative area near the edge and shows the result of 
several successive Lloyd iterations. In general, each gen- 




FIG. 5: Left: a zoomed-in region near the vertical edge, 
showing the original generator points, and their subsequent 
locations in the course of several Lloyd iterations. The points 
are color-coded according to the scaled displacement, 

Right: Predictions for dij^fai for the first Lloyd step, as a 
function of r, for the case of a perfect lattice grid 0 - 
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FIG. 6: Left: ROC curves ss{sb) using ^ as the discrimi¬ 
nating variable. Right: The corresponding Gini index ^ as 
a function of the number of Lloyd iterations. 


erator point i is displaced a certain distance di from its 
original location. It is interesting to note that the edge 
points appear to be displaced the farthest, which is easy 
to understand in terms of simple diffusion. This observa¬ 
tion suggests another criterion for selecting edge cells — 
based on their displacement during the Lloyd relaxation. 
For convenience, we define a dimensionless quantity, the 
scaled displacement where we normalize by the 

square root of the cell area, a^. The color coding in the 
left panel in Fig. [^demonstrates that the scaled displace¬ 
ment could be a useful alternative to the scaled variance 
This is confirmed for the case of the perfect gri d Q 
by the exact results shown in the right panel of Fig. 

Having defined a selection algorithm for edge cells, it is 
prudent to study quantitatively the efficiency of the algo¬ 
rithm, e.g., in terms of ROC curves [36]. For this purpose, 
we generate high statistics samples for ([^, where we treat 
the edge cells as “signal” and the bulk cells as “back¬ 
ground”. We then plot the signal selection efficiency, 55 , 
versus the background efficiency, cb, for different val¬ 
ues of the minimum cut on the variable In the left 
panel of Fig. [^ we show several es{£B) curves, for differ¬ 
ent values of the density ratio p, and either with (solid) or 
without (dashed) Lloyd relaxation. We see that the algo¬ 
rithm works better for higher density contrasts between 
the two regions. Note also the significant improvement 
as a result of adding the Voronoi relaxation. 

In order to quantify the accuracy of our selection of 
edge cells, we use the standard area under the curve 
(AUROC) as represented by the Gini coefficient 

Gi = 2 AUROC -1 = 2 / dsBX ss{sb) - 1, (8) 

Jo 

where a value of 1 is obtained from the ROC curve of a 
perfectly discriminating variable, while a value of 0 corre¬ 
sponds to a totally random selection of events. The right 
panel of Fig. [^ shows the dependence of Gi on the num¬ 
ber of Lloyd steps. We see that the accuracy improves 
very quickly within the first few iterations, and reaches 
an optimum plateau, after which the power of the test is 



FIG. 7: Various Voronoi tessellations of the data from the 
supersymmetry example considered in the text. 


degraded as the Voronoi grid begins to asymptote to the 
centroidal tessellation. 

An example from supersymmetry. As an appli¬ 
cation of the proposed edge detection method, we con¬ 
sider a standard benchmark example from supersymme¬ 
try, squark pair production at the 13 TeV LHC. For 
simplicity, we focus on asymmetric events in which one 
squark undergoes a long cascade decay through a heavy 
neutralino, X 2 ’^ ^ slepton, i; and a light neutralino, Xi'^ 
while the other decays directly to Xi- The mass spec¬ 
trum was chosen as rriq = 400 GeV, m^o = 300 GeV, 
= 280 GeV, and m^o = 200 GeV. The observed fi¬ 
nal state particles are two jets and two leptons, whose 
invariant mass distributions have been well studied and 
are known to exhibit kinematic edges. Here we focus on 
the dilepton invariant mass, mu, and the three-body jet- 
lepton-lepton invariant mass, mju. With the correct jet 
assignment, signal events are constrained to the region 
outlined by the solid black line in Fig. , where 

for plotting convenience we use the {mj^, (m^^^ — m|^)/ 6 ) 
plane. Since we cannot measure the charge of the jet, 
there is a two-fold combinatorial ambiguity, thus the plot 
contains two entries per event. We also include the main 
SM background from tt dilepton events. 

In the upper two panels of Fig. [^ the Voronoi cells are 
color coded by their scaled variance The left panel is 
the original data, while the right panel includes 5 Lloyd 
steps. In the lower left panel we reconsider the original 
data, but extend the calculation of to include up to 5 
tiers of nearest neighbors. We see that both Voronoi re¬ 
laxation as well as including more tiers of neighbors have 
the benefit of reducing the fluctuations and sharpening 
the edge. The two procedures can also be done simulta¬ 
neously — the lower right panel of Fig.[^shows the result 
after 3 Lloyd steps and including 3 tiers of neighbors. 
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Summary. In this letter, we have argued that the dis¬ 
covery of new kinematic features is an essential step in 
the discovery of physics beyond the standard model and 
have advocated the use of Voronoi methods for this pur¬ 
pose. The great flexibility of Voronoi methods is a bless¬ 
ing for the experimentalist; many useful properties of the 
Voronoi cells can be used to construct powerful variables 
tailored to specific new physics scenarios. A voluminous, 
quantitative study of the many options available to the 
experimenter will be presented in a companion paper [20] . 
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